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High-speed pulse amplitude modulated (pam) data transmission 
over telephone channels is only possible when adaptive equalization 
is used to mitigate the linear distortion found on the (initially un- 
known) channel. At the beginning of the equalization procedure, the 
tap weights are adjusted to minimize the intersymbol interference 
between pulses. The "stochastic gradient" algorithm is an iterative 
procedure commonly used for setting the coefficients in these and 
other adaptive filters, but a proper understanding of the convergence 
has never been obtained. It has been common analytical practice to 
invoke an assumption stating that a certain sequence of random 
vectors which direct the "hunting" of the equalizer are statistically 
independent. Everyone acknowledges this assumption to be far from 
true, just as everyone agrees that the final predictions made using it 
are in excellent agreement with experiments and simulations. We 
take the resolution of this question as our main problem. When one 
begins to analyze the performance of the algorithm, one sees that the 
average mean-square error after the nth iteration requires knowing, 
as an intermediate step, the mathematical expectation of the product 
of a sequence of statistically dependent matrices. We transform the 
latter problem to a space of sufficiently high dimension where the 
required average may be obtained from a canonical equation T n +\ 
= jtf(a)V n + &. Here rf(a) is a square matrix, depending on the 
"step-size" a of the original algorithm, and Y n and & are vectors. 
The mean-square error is calculable from the solution T n . 

Information about the solution of our equation is obtained by doing 
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matrix perturbation theory on s#(a) for small values of a. We show 
that the first two terms of the perturbation solution contain, among 
their terms, the terms of the independence theory. Since the parameter 
a needs to be small even for independence theory to converge, agree- 
ment with an exact theory and experiment is obtained if, in some 
sense, the additional terms which appear in the perturbation solution 
may be disregarded. This will usually be the case. 

I. INTRODUCTION 

Adaptive equalization of telephone channels in order to facilitate 
high-speed data transmission has been successful ever since its intro- 
duction by Lucky in the 1960s. This technique uses a linear filter 
(configured as a tapped delay line) to ramove the harmful effects of 
the linear channel distortion. At the start of the equalization procedure, 
a set of parameters, the tap weights, are adjusted so that the final 
setting of these taps minimizes the intersymbol interference between 
pulses in the data train. Many theoretical studies have been made 
concerning steady-state equalization after the optimum tap weights 
have been achieved; little analysis has been done concerning the 
convergence of the equalizer tap weights to their final settings. Even 
in the best published study on this problem (Ungerboeck, Ref. 1), it is 
necessary to invoke an assumption stating that a sequence of random 
vectors which direct the operation of the equalizer are statistically 
independent.! This independence assumption will be explained more 
fully later; for the moment, we only indicate that it is not even 
approximately true. In fact, given the nth vector of the sequence, all 
but one component of the next vector will be exactly known. Yet if 
this assumption is made, surprising agreement with actual performance 
is obtained. 1 Clearly, because of its importance, this situation begs for 
clarification. Hopefully, what we learn in equalization can be used for 
other applications where similar adaptive algorithms are used. In 
particular, the areas of linear prediction and adaptive array processing, 
both electromagnetic and sonar, come to mind. We concentrate our 
presentation on equalization, however, for here the author is sure of 
the details. 

We shall take as our performance criterion the expected value of the 
mean-square distortion, although the average error vector is also 
considered as a simpler problem. In particular, then, we are not 
concerned with the fluctuations which might occur in actual use. 



t We are here concerned with convergence in random data, not with a known specially 
designed sequence. In usual startup operation, the data symbols are also assumed 
known, either by using a known sequence or by assuming that sufficiently accurate 
estimates are available. 
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Typically, the sample paths are close to the mean (see Ref. 1). In a 
nutshell, our contribution to this problem consists of two parts. We 
first establish a time-independent difference equation which governs 
the average in question. This step is accomplished in a space of much 
higher dimension than one would initially assume. Second, examining 
the solution of this equation in a perturbation sense (the small "step- 
size" of the algorithm being the essential perturbation parameter), we 
find the leading terms contain the independence theory solution. 

Before delving into the abstract problem, we devote Section II to 
describing some more conventional aspects of data transmission and 
equalization and Section III to discussing the behavior of the mean- 
square error if the independence assumption is made. 

II. DATA TRANSMISSION AND EQUALIZATION 

For our own convenience, we confine the discussion to binary 
baseband transmission and neglect the effects of additive noise. 

The equalizer, and in fact the entire detection procedure, operates 
on the samples of the baseband received signal r(t), where 

r(t) = la m+ Kh(t-mT). 

If 1/T' is the sampling rate, 1/T the symbol rate, a n the data symbols 
(iid, ± 1 with equal probability) and h{t) the overall system impulse 
response, then these samples aref 

r(nT) = i a m+K h(nT - mT) n = 0, 1, 2 . • • . (1) 

m— — oo 

For a synchronous equalizer, T' = T and for a fractionally spaced 
equalizer, typically T' = T/2. If the coefficients of the equalizer are 
denoted by c„ i = 1, • • • , N (c, being also the ith component of a vector 
c) and the sequence of output samples of the equalizer are y n , then 

y„ = S c s r[(s - l)T + nT] n = 0, 1, 2, . • . . (2) 

s-l 

We call attention to the fact that, even when T' ^ T, the equalizer 
samples are only of interest at multiples of the signaling interval T, 
and the notation of (2) takes this into account. We define a sequence 
(in time) of vectors X tn) such that the sth component of vector X in) is 

s = 1, 2, • • • , N 
Xi n) = r[(s- 1)T + nT] (3) 

n = 0, 1, 2, • • • , 



f We call the bit which "goes with" the mth pulse a,»*K (instead of the usual a,„) for 
later convenience. 
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and thus 

y n = c-X ln) . (4) 

The implementation of (2) to (4) is shown in Fig. 1. 

Later, when we consider an adaptive equalizer, the taps will vary 
with time and c ( "' will be used for the sequence of tap-weight vectors. 
Ideally we would like (at least when n is large enough) the sequence of 
equalizer outputs to be the sequence of data symbols, except, perhaps, 
for a shift. For a finite equalizer (i.e., N finite) this ideal is not 
achievable, and instead the available taps are adjusted to minimize 
the average square error Eel, where 

e n = y n - a„ +K (5) 

and E denotes the mathematical expectation with respect to the data 
symbols {a„}. If one introduces the N X N channel autocorrelation 
matrixf (which is positive definite), 

A = EX in) X MT , (6) 

and the vector, 

v = Ea n+K X {n \ (7) 

both of which do not depend on the time index n, then, for fixed taps 
c, the mean-squared error ^ is given by 

g = E(y n - a n+K ) 2 = c r Ac - 2c r v + 1. (8) 

Equation (8) shows % to be a convex quadratic function of c. Any 
optimum choice of c, say, c*, satisfies 

Ac* = v (9) 

which has a unique solution if A~ l exists. We denote the minimum of 
gf by g"\ 

It will make little difference physically, and it will be a great 
convenience mathematically, if we pretend that the impulse response 
h(t) used in (1) has finite duration. Thus, assume 

h(t) = if \t\>HT. 

Let N\ and N 2 be the largest integers such that 

NiT^HT (10a) 

(TV - 1)T' - 7V 2 T> - HT. (10b) 

Further, choose the integer K in (1) to be Nt and set M — N\ + N 2 + 
1, and let a'"' be an M- dimensional vector whose tth component is 



f The superscript T always denotes transpose. 
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r(t] 



*>u 



k-0, 1,2, . 




Fig. 1— Adaptive transversal equalizer, N - 5. 

a}"' = o„+,-i, i = 1, • • • , M. Then using (3) and (1) we have 

X (») = Ba M 

where in (11) B is an N X M matrix having elements 

1 £ i < N, 
Bu = h[(i -1)T'+ (Ni + 1 -j)T], 

1<><M. 



(11) 



(12) 



It follows from (10b) that M > N if T' = T and M > (iV + l)/2 if T' 

= r/2. 

The structure of the matrix B is illustrated below for the special 
case T = T, N = 3, M = 7. 



B = 



/l 2 /l, ^o A-l ^-2 

h 2 Ai /io h- x h- 2 
h> fix ho h-x h- 2 \ 



This structure means that X"" has the same shifting property as a'"'. 
Thus, for example, in time sequence, 



, etc. 



a 




b 
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-> 
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-+ 
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c 




d 




e 
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Since 

Ea M a (n)T = I, 
it follows from (6), (11), and (13) that 

A = EX M X ln)T = BB T . 
For the special case T = T, h(nT) = 8 n0 , then M = N, A = I, and 

<*) _ „(n) _ 



(13) 
(14) 



X M = a (n) = 



O.n + N-1 



(15) 



We now begin to describe the stochastic gradient algorithm used for 
equalizer convergence. But first we describe a different problem, the 
deterministic gradient algorithm, which is a method for finding the 
minimum on the surface SP, where 



9> = c T Ac - 2c r v + 1. 



(16) 



This provides some heuristics for writing down the stochastic algo- 
rithm, but should not be confused with it. We take pains to point out 
some differences as we proceed, since many people substitute discus- 
sion of this algorithm for the actual one. 
Taking the gradient of (16) gives 



Vy = 2[Ac - v]. 



(17) 



Hence, if we were searching for a minimum of the function (16) by 
taking steps in the gradient direction, we would write the following 
equation for our position c (n) at the nth stage 



c <k+1) = c (n) _ A ( Ac M _ y)> 



(18) 



A being a step-size parameter. Equation (18) coupled with (6) and (7) 
motivates the actual stochastic gradient algorithm used, namely, 



c <i.+i) = c'"' - ct[X in) (X ln)T c {n) ) - a n+K X M ] 



= c"" - 



ae n X n , 



(19) 
(20) 



e n being the scalar error (5), and a the step-sizef. Thus in iV-dimen- 
sional tap space we move in directions X'"', where X (n) is [see (4)] the 
vector of values stored in the equalizer at time nT. Clearly, the allowed 



t It is, of course, meaningless to speak of the "size" of a unless one fixes the size or 
scaling of the terms which multiply it in (20). We shall take the scaling of the latter so 
that, in the binary case, the matrix A [see (6)] has largest eigenvalue unity. 
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set of directions along which we "step" is, as (15) will testify, quite 
random and cannot be thought of as being gradient directions. Nev- 
ertheless, tradition dominates, and (19) and (20) are still referred to as 
a stochastic gradient algorithm. 

For our purposes, (19) may be rewritten slightly by introducing the 
error vector 

€ (n) = c («) _ c * (21) 

Subtracting c* from both sides of (19) allows us to write 

6 (n + n = (/ _ a x^X ln)T )€ in) - ate* 7 *" - a n+K )X {n \ (22) 

Note the quantity c^X'" 1 — a n +K is the instantaneous error if the 
optimum taps were used. This is normally quite small and would be 
zero if perfect equalization were possible. 
In terms of the € {n) , the mean-square error is 

g™ = gf* + € {n)T Aa {n) = %* + &0? . (23) 

In (23) the symbol &£ has been introduced for the excess mean-square 
error over &*. 

In (22) and (23), c <n) is random, and in fact depends on the entire 
sequence of data symbols since the adaptation began. Our measure of 
the progress of the algorithm will be E^ {n \ the average of the error at 
time n over all data sequences. * 

III. THE INDEPENDENCE THEORY 

In this section we describe "independence theory," an approximation 
used to mathematically treat the stochastic gradient algorithm de- 
scribed by (22). Use of the approximation allows one (as we shall see) 
to determine bounds on the step-size a which will ensure stability and 
allows calculations to be made on convergence rates. 

Independence theory treats the stochastic algorithm by assuming 
that the sequence X (n) are statistically independent vectors. Since, 
from (22), e (n) depends only on the sequence X (l) , • • • , X (n_1) (assuming 
we start with X (I> ), we conclude e (n) and X <n) are independent. For an 
example as to how this is applied, we look at the average error vector 
Ee M . We have, from (22), (6), (7), and (9), 

Ee< n+l) = (I - aA)Ec M . (24) 

If, for comparison, we introduce the error vector c„ — c* for the 
deterministic theory and call it d ln> so no confusion can arise, we would 
have, subtracting c* from both sides of (18), 

d (n+,, = (/- AA)d (n) . (25) 
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There is no question of an average in (25); d in) is the error. In (24), 
Ee ln) can be zero although the norm of € (n) can be quite large. 

To emphasize the difference further, let us return to the simple 
model (15) which describes an undistorted channel, for which perfect 
equalization is possible. Only the initial setting of the taps is wrong. 
For this case, we have [note A = I in (23)] using (22) and the 
independence assumption 

E€ {n+l)T € {n+i) = e ln)T (I- riK"K MT )(I- aX , ' ,) X (n, V ^, 

= (1 - 2a + a 2 N)€ in)T € M . (26) 

Thus the error decays to zero as 

(1 - 2a + a 2 N) n g m , (27) 

which is optimized if a = 1/N to give 

1 - i) §f (0) . (28) 

Note how convergence is slowed as the number of taps N of the 
problem increases. By contrast, if A = I in (25), choosing A = 1 gives 
convergence in one step, independent of dimension. 

The convergence range of (24) for A = I is < a < 2, while for (27) 
it is < a < 2/N. In practice, N ranges from about 7 to 64 and thus a 
is, by the requirement of convergence of the mean-square error, kept 
quite small. 

In order to examine independence theory further, it will be conven- 
ient to discuss the (positive definite) error matrix 

R M = Ee M € WT (29) 

All the information we wish about Eg™ , the average excess mean- 
square error, is contained in (29). Thus, from (23) 

E%™ = Ee {n)T Ae ln) = £ (a)ij(Ee M e {n)T )ji 

= tr AR {n) . (30) 

Similarly, the average norm E \\ e (n) || 2 = tr R (n) . 

Our procedure for writing an equation for the time evolution of R {n) 
is simply to write the definition of R in+1) using (29), substitute (22) for 
c (n+1) , and do the average using the independence assumption. Various 
cross terms arise, and the computations naturally fall into three steps: 

Step 1: 

E[I - aX M X MT ]€ {n) € MT [I - aX ln) X ln)T ] 

= R (n) - a[AR {n) + R in) A] + a 2 E[X {n) X (n)T R (n) X {n W n)T l (31) 
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Appendix A discusses the evaluation of the last term. For simplicity, 
we approximate the exact evaluation by a 2 A tr AR in) . When A = I, we 
have tr[a 2 A tr AR M ] = o?NW&, so in general this term plays the role 
of the a 2 N term in (26). 
Step 2: 

Ea[I - aX ln) X (n)T ]€ {n) X MT (c* T X M - a n+K ). (32) 

This is considered further in Appendix A and, for reasons given there, 

is approximated by zero. 

Step 3: As discussed in Appendix A, 

Ea 2 (c* T X ln) - a n+ K)X ln) X ln)T (c* T X ln) - a n+K ) « a 2 ^*A. (33) 

Putting together these three steps, we have the following accurate 
approximation from independence theory: 

R (n+l) _ R (n) _ a [ AR M + R M^ + ^ ^ AR M + a 2g* A (34) 

Note that the last term prevents R {n) = from being a solution. Thus, 
R {n) is prevented from going to zero by the small forcing term. Thus, 
in particular, c <n) only approaches zero but then executes small fluc- 
tuations about zero. 

Since (34) is an approximation, we prove in Appendix B that the 
positive definite character of R (n) is preserved in (34). 

We now introduce a more useful form of (34) when the mean-square 
error is of primary interest. Since A is hermitian, let U be the unitary 
transformation which diagonalizes A, 

U + AU=D, (35) 

where we call the elements of the diagonal matrix D, by di. Further, let 

U + R in) U=T ln) . (36) 

In general, T <n) is not diagonal, but set T}?' = t\ n) . Further, note 

N 

&£ = tr AR {n) = tr DT {n) = £ dit\ n) . (37) 

i-i 

It follows from (34), (35), and (36) that 

T (n + l) = T (n) _ a [ DT M + jMjryj + ^ ^ jypM + a ^* D _ ( 38 ) 

Noting from (37) that the mean-square error depends only on the 
t\ n) , we are motivated to look at the diagonal terms of (38). Happily, 
they decouple from the off-diagonal terms and we have 

n 

f(«+i> = t \n) _ 2adit\ n) + a 2 d t £ djt} n) + a 2 g"<£. (39) 

y-i 
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If we introduce vectors t (n) and d in the obvious way, (39) itself can be 
rewritten in matrix notation as 

t <n+1) = Mt in) + a^T*d, (40) 

where the N x N matrix M has elements 

M u = (1 - 2adi)8ij + a 2 didj . (41) 

From (41) we note M is real and symmetric and thus has real eigen- 
values. 

The solutions to (40) will be stable if and only if the matrix M has 
all eigenvalues A, such that — 1 < A, < 1. Let g be an eigenvector of M 
with eigenvalue A. Then 

Mg = Ag (42) 

reads 

gi - ladigi + a 2 (Z djgj)di = \gi 

j 

or 

gi denoting the components of g. 

In (41) we see that, whenever di = 0, there is a A = 1 for all a. The 
eigenvector has gi = 1 and gj — 0, j 9* i. These eigenvalues do not 
change with a and are not of interest here. Set di = di if rf, ¥= 0. Then 
we are concerned with 

Mij r - (1 - 2adi)8ij + a 1 didj (44) 

in a space of appropriately reduced dimension N. For a small enough, 
the eigenvalues are approximately 1 — 2arf, < 1 (a > 0, of course). Now 
increase a until possibly one of the eigenvalues becomes ±1. What is 
the critical value of a? Since all elements of (44) are strictly positive 
(except at most N values of a), the magnitude of the largest eigenvalue 
may be taken to be associated with a positive eigenvalue. 2 Thus, in 
(43) [reinterpreted to match (44)], set A = 1, multiply by di, and sum 
on i. We then obtain 

Thus, independence theory predicts a stable algorithm if 

0<a< A=4 (46) 

d being the average eigenvalue of the channel correlation matrix A. 
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The excess error ifj? after adaptation may be derived from (39) 
using (37). We get 

rg— r» . £ * (47) 

2 - a 2j di 

The above discussion should provide the reader with an idea of what 
we hope to justify and why. The independence assumption, if it leads 
to valid results, provides a very workable theory for gaining insights 
about, and doing calculations on, the convergence procedure. 

IV. AN EXACT DESCRIPTION 

In this section, we put forth an exact description of how, in principle, 
the average mean-square error may be obtained. We begin, however, 
with the average error vector Ee ln \ a simpler quantity, but one which 
requires essentially the same treatment. The exact dynamics of e ln) is 
given in (22), and the independence theory for Ee {n) is given by (24). 

For simplicity, we rename the terms in (22) 

(/ - aX M X MT )€ {n) = P n € {n) (48) 



and 



so (22) reads 



-a(c' T X M - a n+K )X M m f« (49) 



e in+l) = p ne (n) + f(n)^ (50) 

which, by iteration starting with a fixed error vector c <0) , has the 
solution 

,(i + l> _ n P .,«» ■ V / TT D. l#<«) J_ *<"> 



= n Pi€ m + s ( n p* f s) + f(n) - < 51 > 

1-0 4=0 \i=S+] / 

Note in (51) the matrices P, do not commute so that a product []" ^i 

means in the order P n • • • P2P] • 

We proceed to examine (51) in more detail. We remark first that, by 
their very definition, P n and f" 1 depend on the data variables {a„, a n +\, 
• • • , a n+ M-i) [see (11), (15), (48), (45)], and thus € {n+i) depends on the 
entire sequence {a,}"Jo M1 . If we formally average (51) making use of 
the stationarity of the basic Bernoulli sequence {a,} , we have 

/ n \ n / s 

■(n+l) _ / F TT D. |-(0) . V I J? TT D^(0) 



E € *+» =[E\[ P,j€ m + S I E p Pf m j, (52) 

the expectation being taken over all binary variables which enter (52), 
namely, Oo, «i, • • • , a n +M-i. The first term of (52) represents the decay 
of the initial error to zero (the transient); the second term is the forced 
response, causing a small but nonzero steady state error as n — > 00. 
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We have not been able to work with (52) directly, and at this point 
our analysis takes a crucial turn. We average (51) again, only this time 
we do not average over all the binary variables which enter but only 
over the sequence do, fli, • • • , a n . Call this conditioned average E n . 
Then 

E n e* +1) = (E n Y[p\e™+z(E n ft P,/» ) + JW» (53) 

\ i-0 / »=0 \ i-s+1 / 

Now, however, (53) is not one vector equation but 2 M_1 of them, since 
it is valid for any sequence of values of {a n +i, • • • , a n +M-\}\ these 
variables appear in (53) for arbitrary values. Thus the set of values 
just mentioned form a "super-index" which we may collectively call J, 
J taking 2 M ~ l values. For example, we might choose to call (for M = 
3) the values {+1, +1} to be J = 1, {+1, -1} to be J = 2, {-1, +1} to 
be J = 3, and {-1, -1} to be J = 4. For the moment, however, the 
precise mapping from the (M - 1) binary variables to the integer J is 
unimportant. 

We also want to consider the matrix 



P^J-oX^X 



w-yWT 



(54) 



not as a N X N matrix, but as one consisting of 2 M 'x2 M ' blocks of 
N X N matrices so that it may act in (53) as a transition matrix 
between vector blocks. 

Thus in (54) P n is determined by X" 1 ', i.e., from (1), by 



X (n) = B 



Ct„+i 



Cln+M-l 



(55) 



Hence the "super-index" J corresponding to the vector result of an 
operation by P„ would be the last M - 1 components of a'"', namely 
(a„+i, a„ +2 , • • • , a„ +m _i). On the other hand, P„ acts on a quantity 
determined by 



'(«-« _ 



= B 



CL n -\ 



fl/i+m-2 



(56) 



that is, something with vector index J' = (a n , • • • , a„+ m - 2 ). Thus if we 
call 

I-aX {n) X {n)T =K(J,J f ), (57) 
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K(J, J') can only act between index pairs (J, J'), which are "shift- 
compatible." Thus if 

J *-* (si, • • • , S m -i) 

•T**(*i, •••,fcn-i), (58) 

where the s, and U are binary variables, then 

K(J, J') = unless s, = fc+„ i : - 1, • • • , M - 2. (59) 

On the other hand, if (J, J') are shift-compatible, this is enough to 
determine the appropriate X (n) . Thus with (58), (59), 

S] 



X ln) = B 



S m -\ 



(60) 



and we use (57) to define the appropriate K(J, J'). Having, in the 
manner thus described, achieved the block structure (57), we define 
the iVx 2 M ~ X dimensional square matrix 



A(«)=- 



K(l, 1) #(1,2) ••• #(1,2"-') 



_K(2 M ~ l , 1) 



(61) 



There are, in fact, in any row of (61) only two nonvanishing blocks. 
Summing over the row thus corresponds, because of the factor of M> in 
front, to averaging over the first component a„ of A (n) . 

We write any N vector which is further labeled by our block index 
J [v( J), say] as an N X 2 M_1 vector V 



V = 



fv(l) 
v(2) 



v(2 A/ -) 



(62) 



To tie this all together, it is now easy to convince oneself that, if we 
let V n+i correspond to E„e {n+i) as in (62) and, similarly, let F correspond 
to E n ? n \ then, by making use of the stationarity of the averages which 
appear, (53) represents the solution of the equation 

V n+ , =A(a)V n + F (63) 
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with initial condition 



V = 



.(0) 
JO) 



.(0) 



- [€ ,0) ]. 



(64) 



In (64), the notation [v] has been introduced to represent an n- vector 
"stacked" 2 M ~ y times. 

The solution to (63) and (64) contains all the information we want. 
In fact, once V„ is known we, by definition, know E n -\€ {n) {J), where 
we have modified the notation slightly to make explicit the dependence 
on J <-> {a„+i, • • • , a.n+M-\). To regain € (n) , we simply average: 



j 2*"' 

€ {n) = E[E n - i € ln) (J)]=^ M ^ £ E n -ie ln) (J). 



(65) 



j-\ 



The average in (65) can be put in another form if we introduce the 
matrixf 



P> = 



,M-\ 



I I I ... 
III". 



(66) 



I I I .-■ 

having each N X N block equal to the identity matrix. Then 

[ € <«)] = PjV.. 



(67) 



We may already note that Pi is an orthogonal projection operator (P? 
= Pi, Pi = Pi) and (67) thus states that [e ln) ] is a projection of V„ into 
an appropriate subspace. Further, note that 

[Ef™] = [0] = P,F (68) 

and thus F belongs to the orthogonal subspace. 
The formal solution of (63) (including the final projection) is 



P,V„ = P,A"(a)[c (0) ] + P, X A*(a)F 



(69) 



having the limit 



P,V--P,[/-A(a)]- 1 F. 



(70) 



t We hope a warning that the symbol P\ is being used for different things in (66) and 
(48) will eliminate confusion. 
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Both (69) and (70) can be computed using the spectral decomposition 
for functions of a matrix A. If A has all its eigenvalues A, of index one, 
that is, if its eigenvectors (/, span the space (all Jordan blocks one- 
dimensional) and if Wi are corresponding eigenvectors of A T , chosen 
so that 

WTUj = 8ij, (71) 

then for (almost) any function h( • ), 

h(A) = lh(Xi)ViWT. (72) 

i 

Roughly, h{ • ) is restricted so that h{\i) is defined. A similar but more 
complicated theorem holds if the £/, do not span the space. If a ^ 0, it 
may be reasonable to assume that the £/, do indeed span the space, 
but for a = they do not. 

We may already note that asymptotic stability of the full-fledged 
algorithm is guaranteed if all eigenvalues of A (a) are less than unity in 
magnitude. In fact, only those eigenvalues which are associated with 
a Ui such that PUi f* need have magnitude less than unity. 

In general, because of the very large dimension (N2 M ~ 1 ) encountered 
in practical use, the above theory would be more useful if workable 
approximations could be found. We present one such approach in 
Section VI which is based on a perturbation approach for small step- 
size a. Before doing that, we retreat a bit to demonstrate how the 
mean-square error may be brought into essentially the same form just 
developed for the average error vector. 

We again find it more convenient to discuss the error matrix R {n) 
defined in (29). We substitute (50) directly into (29) and perform our 
trick of taking the average E„ (which involves averaging only over ao, 
d, • • • , a„ leaving a n +\, • • • , a„+ M -i fixed) to obtain 

E n R ln+ " - I I P n (E n -iR ln) )Pn + I X f M f ln)T 

+ \l {Pn(E n -l€ M )* n) + ^(En-^VPnl (73) 

" a n 

In (73), £<,„ refers to summing over a„ = ±1. Note that in (73) the 
sequence of quantities E n -\e {n) may be regarded as known (or calcula- 
ble) since they are the N dimensional sub vectors which make up the 
NX2 M ~ 1 dimensional solution V n to (63) and (64). 

We will rewrite (73), but first we need some notation. If R is any N 
X N matrix, we may make an N 2 dimensional vector out of it by 
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writing the quantity 



i(R) = 



fill 

R\2 



R\N 

Rz\ 

R22 



R 



Rn 



R 



(74) 



AW 

We call i(R) the vector made out of R. 

In this trivial sense, we use £( • ) as an operator. We use this to turn 
some of the terms in (73) into vectors. Introducing the "J-index" for 
emphasis (it is, of course, implicit when we use E n ) we define 

w ln+1) (J) = £[E n R ln+ "(J)] (75a) 

g( J) = KEnf 1 "^ 7 } (75b) 

g( Vn, J) - |[£„f < ' ,) f ( ' ,)r + E n P n € ln) f in)T + E n f M € in)T P n l (75c) 

Next we note that if A, R, and B are N X N matrices, then 

Z(ARB) = C£(R), (76) 

where C is an N 2 X N 2 matrix. In fact, C is the direct product A ® B T 
where A ® Z? (not A <8> B T ) is given by 

a.\\B d\iB • • • (IinB 

dl\B CI22B • • • Q.2nB 



A®B = 



cln\B &N2B 



o.nnB_ 



(77) 
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In the notation (75) and (77), (73) may be rewritten as 



W <B+,, (J) = - I Pn P„W <n) (J') + g( V n , J). 
& a 



(78) 



In (78) J' is the compatible pair of indices that are allowed with J. As 
in (63), we form N 2 X 2 M_1 dimensional vectors W„, G, and G(V„) 
from w (n) (J), g(J), and g( V n , J), respectively. And finally, using the 
definition of K(J, J') in (57) to (60) we write 



B(a)-- 



K(l, 1)®K(1, 1) K(l, 2) <8> K(l, 2) 



L^(2 M - 1 ,1)®^(2 M - , ,1) ••• 
The collection of equations (78) reads 

Wn +i =B(a)W n + G(V n ). 

Equation (80) with (63), (64), and the initial condition 

U(€ i0) € l0)T )" 



(79) 



(80) 



Wo = 



£(€°€ (0>r ) 



^(cV 0,r ) 



(81a) 



provide an exact description of the error matrix. 

To simplify matters, we replace (80) by the approximate version 



W„ + , = B(a)W n + G, 



(81b) 



where G, as already defined, is formed from (75b) as G(V„) was formed 
from (75c). When more is understood about the solutions of our 
equations, we see that the replacement of (80) by (81b) is not a serious 

matter, f 

Again, we are not interested so much in W n as the projected version 



[£(/? <n, )] = P,W„. 



(82) 



In (82) the bracket notation is the same as (64) except that £(/? <nl ) is a 
vector of dimension N 2 instead of N. Also, in (32) P, has the same 
meaning as in (66) except that the identity matrices are all in N 2 
dimensions instead of N. 



t In most situations, G(V„) is small compared to the initial error and the associated 
transient. The main effect of the forcing term is to give a nonzero error as n — » oo. But 
V n -» 0, and G(V„) reduces to G. 
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In summary, we thus see that both eq. (63) for V„, which represents 
E n -i€ (n \ and (81b) for W„, which represents E n -iR {n \ have the form 



r n+1 = sf(aWn + &, 



with 



(83) 
(84) 
(85) 



n = w, 

and the quantity of interest being 

for the appropriate dimension and projection &\ . 

V. THE CASE o = 

The equalization problem is uninteresting when the step-size is 
taken to be zero, i.e., nothing happens. However, since we soon intend 
to do a perturbation analysis about a = we must be familiar with our 
formalism when a = 0. This is not trivial, and we devote this section 
to it. 

To display matrices explicitly, we need a labeling procedure. We let 
the "super-index" J run from to (2 M_1 - l).f The J value which 
labels (ai, • • • , aM-i)(ai = ±1) is gotten as follows: Change +1 to 0, 
and —1 to 1, obtaining then binary representation of J. Thus, for M 
= 3, J = 0, 1, 2, 3 correspond respectively to (+, +), (+-), (-+), and 
(— , — ). With this labeling we have 







1 




1 








1 




1 








01 





01 









01 




01 






1 


001 




001 




j/(0) 


~2 


001 




001 








000 


.. 1 


00 • 


. 1 






000 


.. 1 


00 • 


. 1 



/=r<8>/. 



(86) 



Let 5 be vector space of dimension N or N 2 accordingly as T n in 
(83) refers to V„ or W„. Then in (86) /refers to the identity in S. 

The matrix s?(a) has the same structure as (86), with each identity 
being replaced by the appropriate / — aXX T or (/ — aXX T ) ® (/ — 
aXX T ). 



t This labeling is for descriptive convenience here. We hope the reader is forgiving if 
we later let J = 1, 2, • • • , 2^"'. We will be explicit about the convention when it matters. 



980 THE BELL SYSTEM TECHNICAL JOURNAL, MAY-JUNE 1979 



Table 1— 


-Zero 


eigenvalue 


structure of T 


Index 




# Blocks of 
This Index 


M- 1 
M-2 
M-3 
M-4 
M-5 




1 
1 
2 

4 

8 


M-l 




2'-' 2 (/>3) 


M- (M - 2) 

1 


= 2 


2«-< 
2**-' 



The matrix T in (86) is basic to our study and we now concentrate 
on it; it has dimension 2 M ~\ Clearly, the all-ones vector is an eigenvec- 
tor of r having eigenvalue one. The reader may convince himself that 
r** -1 is proportional to the matrix consisting of all ones, which has 
(2 M ~ l — 1) eigenvectors perpendicular to the all-ones vector. These 
eigenvectors are associated with eigenvalue zero. Using the fact that 
the eigenvalues of a power of a matrix are the powers of the eigenval- 
ues, we conclude that Y has one unity eigenvalue and (2 M_I — 1) zero 
ones. The zero eigenvalues are not of index one however (index, recall, 
is the dimension of the Jordan block). Table I summarizes the structure 
of the zero eigenvalues of T. 

While it is not crucial for the sequel, we also give the eigenvectors 
and generalized eigenvectors of T. These are the columns, albeit 
permuted, of Hadamard matrices H„ constructed according to Hi = 1, 



H<i n — 



H n H n 



. H n —H n _ 



(87) 



Rows and columns of H„ are labeled from to n — 1. Our claim is that 
the columns of H(2 M — 1) are the (unnormalized) generalized eigen- 
vectors of r. Recall that a sequence of vectors xi, I = 1, ■ • • , k forms 
a chain of generalized eigenvectors corresponding to a A-dimensional 
Jordan block when 

TXi - X M I = 1, • • ■ , k - 1 

VXk = XX ),. 

Clearly, the last 2 M ~ 2 columns of H satisfy TXk = and these are the 
only ones. If c h is the Ath column, 2 M ~ 2 +\<k< 2 M ~\ then the chain 
that ends with it is, in reverse order.f 



t For (88) to hold, it is essential that the first column be labeled c . Also, of course, 
the c* of this section is different from c* in Section II where it signified equalizer taps. 
No confusion should arise. 
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(Ck, Ck/2, Ck/A, •••)■ 

These notions may be verified for 



H 6 = 



(88) 



(89) 



The chains are (4, 2, 1), (5), (6, 3), (7). If we rearrange the columns of 
Hs to give 



H% = (co, C5, Ci, Ce, Ca, c 4 , C2, Ci), 



(90) 



then 



— if 8 Ffla — 

o 



01 

00 



10 
1 




From the direct product structure in (86) we conclude that if <I>, are 
a complete o.n. set for S, then the generalized eigenvectors of s?(0) are 



(91) 



y/2* 



Ck ® $,, 



(92) 



Ck being the columns of the Hadamard matrix just described. In 
particular, j/(0) has N (or N 2 ) unity eigenvalues of index one, having 
eigenvectors 



Ui = 



v^ 1 






Qi 



(93) 



the remaining eigenvalues are zero. The projection operator onto the 
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space spanned by the eigenvectors having A = 1 is, using (93), 

^UiUT-Px, (94) 

i 

where 8P\ has already been introduced in (66), / being the identity of 
S. Since 8P\ = PT, the projection is orthogonal. We call the projection 
onto the "zero eigenvalue subspace" of jrf(0) by ^o and ^o = I — Pi. 

Thus, when we solve (83), we really desire, according to (85), not V n 
but &\V n , its projection onto the unity eigenvalue subspace of s/(0). 

A standard spectral representation of j/(0) is 

jtf(O) = Pi + So, (95) 

where 2lf~ x = 0. This defines (for us) @ Q . It may be shown that &\9q 
= 0. 

We remark here that our basic equalization problem is unchanged 
if any infinite sample sequence of data values {a„} is replaced by their 
negatives. This follows from the quadratic nature (in the a n ) of the 
algorithm (19). As a consequence, we have 

E n € M (I+* 8 h • • •, Sm-i) = Ee ln) (J++ -si, .-., -8 m -i) (96) 

and similarly for w ln) (J). We have not exploited this symmetry, but if 
we had, the dimension of s#(a) could be reduced by a factor of 2. j/(0) 
would then, in particular, have a different form, but would have many 
of the same properties discussed here. 

Finally, we take this opportunity to get some notational problems 
out of the way. We introduce a convenient way of labeling matrices C 
with block structure as in (86). Label rows by ju, /i = 1, 2, • • • , 2 M ~ l and 
likewise columns by v. If we write 

H = (i - l)n + k 1 < ij < 2 M ~ l 
p-(j-l)n + l 1 < k, /< N (or N 2 ) 

n - N (or N 2 ), (97) 

then the pair (i,j) specifies which block we are concerned with, while 
the pair (k, I) are the usual matrix indices for the N X N (or N 2 X N 2 ) 
matrix in that block. Thus, for example, in (77), 

(A ® B), y = aijbki. (98) 

Likewise, in (92) the vector c ® <D has components 

(c®*) f = c,d>*, (99) 

The orthonormal basis for S where the £th basis vector has a one in 
the Ath position and zeros elsewhere is denoted by {e*} . 
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VI. THE PERTURBATION THEORY 

We begin the next stage of analysis by writing our matrices in a new 
basis. Consider the orthogonal transformation matrix U which brings 
j^(0) to Jordan form, namely, the matrix U whose columns are of the 
form 



V2^ 



Ci$9e k , 



(100) 



where c, are columns of the Hadamard matrix of appropriate dimen- 
sion, and e* are the basis vectors of S. In (100), i and k range over all 
possible values. The columns of U are assumed to be arranged so that 
the result on j^(0) comes out "nice." We will not bother to be too 
explicit, except to say that the first N(or N 2 ) columns of U are 



V2^ 



c ®e* k = l, ..-,N(N 2 ). 



Thenf 



U T s/(a)U = 



y « 



= jj{a). 



(101) 



(102) 



In (102), /? is an N X N matrix, v is N X (2 Af_1 - N) matrix, etc. If a 
= 0, (102) takes the form 



J 
J 



(103) 



J being a Jordan block exemplified by (91), i.e., "nice." Note that J f 
= 0if/>M-l. 

In general, when a ¥> 0, all blocks in (103) have added terms which 
are linear in a, or linear and quadratic, depending on whether (61) or 
(79) applies.:}: 

We shall be especially concerned with the matrix /?, for it is here 
that the germ of independence theory appears. To calculate it, we 
want 



#M = 



-V2 1 



c <6>e/i 



jtf(a) 



c Qs>e/ 



-V? 



(104) 



Calling the (m, n) element of the (ij) block of s?(a) by VfcflL, (104) 
becomes 



t Henceforth, we denote transformed quantities by a tilde. 

^ The reader should note that the simple equations (27) and (28) suggest that the 
linear and quadratic a-terms are of equal importance for ranges of a of interest. 
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/?*' ■ -^m % (co)i(e k )m VLn (c ),(e,)„. 



(105) 



Now ff mn = whenever r,> in (86) is. Thus, for fixed mn there are only 
2 m possible ff mn which are nonzero. Denote the sum over these as 
^nonzero. Then (105) becomes, using (e*) m = 8km, (co), = 1 



& nnri7.pro 



(106) 



Equation (106) gives fiki as the average of the (k, I) elements of all 2 M 
blocks in s#(a) which are not a priori zero. This, however, is nothing 
but 

E(I - aXX T ) =I-aA, (107) 

precisely the matrix which enters in the independence theory! Like- 
wise, if stf{a) = B(a) 

E(I - a XX T ) <S> (/ - aXX 7 ) (108) 

is the matrix by which we would solve independence theory had we 
rewritten (31) giving R M its vector form rather than its matrix form. 
What do vectors look like with our new o.n. basis? If V is a column 
vector of numbers in the original basis, then in the new basis the 
numbers are UjT. Let V be considered as blocks of N(N 2 ) vectors 
<!>'; the Ath component of each is <$*. Then the inner product of a 
particular row of U with V, namely, 

Etc.xe*)^ 



is a generic term of U T v which evaluates to 

I (C,-)/*i. 



Thus the first N(N 2 ) components (the first blocks) is simply V2 
times the average of the blocks of V. In other words, 



(109) 



M-\ 



O 
$ 



o 



y® 



(110) 



The right member of (110) is, of course, written in a notation compat- 
ible with (102). Likewise, a vector with zero average transforms to a 
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vector which may be written 



V2^ 



* 



(111) 



Thus the initial condition for (63) or (80) is of type (110) unlike the 
driving term for (63), which is of type (111). The driving term for (80) 
has both types. 

Finally, we note that the projection operator onto the unity eigen- 
space of 30 (0) is 



(112) 



<?! = 


n 

.0 




oj 




while 




^0 = 


"0 
_0 


ol 

/_ 




It will also be convenient to write 




TjTqtr — vr — 


1 


.y n . 


U r n r n 



(113) 



(110b) 



X n +1 


, 


li V 




X n 


+ 


o 


.y n +i . 




Ly s\ 




.y n ] 




L^J 



Putting together the pieces just described in this section, the contrast 
between the mathematics of the exact theory and independence theory 
is as follows. The former problem is the following: solve for x n where 



(114) 

where x is given, y = 0. The latter problem is: Solve for x n where 

x n+ i = Px n + $, (115) 

xo is given. Note if v and y in (114) were zero, the solution to the two 
problems would be identical. Since v and y vanish when a = 0, we may 
hope a perturbation approach will be useful for small a. More specifi- 
cally, we treat 

LY 8 ~ /. 

as a perturbation of (103), the matrix A (a) when a = 0. 
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We begin by considering the eigenvalue problem for stf(a). When a 
= 0, the eigenvalues of /? are unity while these of 5 are zero, and these 
eigenvalues vary continuously as a is increased. Consider solving for 
the large eigenvalues. In general, we have to solve 



fix + vy = Xx 
yx + 8y = Xy, 



(116) 



where X is one of these eigenvalues, presumed close to one. Since the 
eigenvalues of 8 will be presumed smaller than X, (X — 5) -1 exists and 
we conclude from the second equation of (116) that 

v = (A - Sy'yx. 

Substituting this into the first equation yields 



ft + v 



X-8 



x = Xx. 



(117) 



Consistent with the perturbation spirit, we replace the X (on the left) 
by 1 and 8 by its value when a = 0, namely, J [see (103)]. 
Thus the large X's are (approximately) solutions to 



($+ v 



x = Xx 



and the corresponding eigenvector to j#(a) is, approximately, f 



(118) 



/-/ 



yx 



(119) 



Using these approximations and applying the spectral decomposition 
discussed in (72) to evaluate s& n (a), it is now straightforward to show 
that the desired solution to (114) is, at least if we neglect the small 
eigenvalues. 



X n = 



/?+ V 



I- J 



*0 



n-1 

+ 1 

s-Q 



/3+ v 



I- 



$+ v 



I-S 



* 



. (120) 



f With the present representation, the perturbation theory has been painless. More 
formal and more thorough approaches to perturbation theory of matrices may be found 
in Refs. 2 and 3. 
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From (120) we conclude 

Jim xn = [I - - v j^y Y ]- ] [0 + v j^j *]. (121) 

In fact, the steady-state error can also be computed exactly from (114) 
as 

[J-jS-'j^Yr^ + 'y^*! (122) 

Within the spirit of our approximations, (122) is consistent with (121). 
The neglect of the small eigenvalues is justified by the fact that their 
contribution will damp out quickly, and also that they operate in a 
subspace approximately orthogonal to the one we are interested in. 
Thus in (119) the "second half" of the large eigenvector is small 
because of the y factor. The corresponding form for the "small" 
eigenvectors would have the first portion small. 

We take (120) and (121) as our approximate solution. The terms 

and v—^—, 7 * (123) 



I-J' I-J 

are higher order terms in the perturbation, and neglecting them we 
obtainf 

x n = P n x + J /TV (124) 

s-0 

*»"T~p*' (125) 

exactly what independence theory would predict. 
To examine further the key expression 

fi+pjl^y, (126) 

some more concrete expression for the vy type terms is needed. For 
example, consider an initial error matrix Ro. Then 

R l = E[I - aXiXT]Ro[I - aXiXT]. (127) 

This must correspond to /? and so, as we already know, 

P = E[I- «X,Xf] ® [/ - aXjXf]. (128) 

In general, then (neglecting the forcing terms), independence theory 



f Noting that (A + e)" » \" for n = 0(l/c) but not for n —* oo, we expect the 
approximation to break down after a while. This may very well happen only after the 
taps have, for practical purposes, converged to the desired solution. 
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can be written (letting P n = I — aX n X£) 

R n+ i = EPiRnPu (129) 

If we consider two iterations 

R 2 = E[I - aXtXW - aXiX^RoU- aX x Xj][I- alfrXg], (130) 

this corresponds, on squaring the matrix in (102), to ft 2 + vy. Thus vy 
corresponds to 

^[P 2 P I i?oP 1 P 2 - PnPxRoPxPnl (131) 

where P«, is simply a notation denoting that it (P„) is to be treated 
independently of Pi. The matrix Ro is not statistical. The proper way 
to write (131) isf 

i>y = E[(P 2 ®P2)(Pi®Pi)-P 2 l (132) 

In general, it can be shown 

i M-2 M 

v-. 7 y= £ 'f'y- I [E(P s <8>P*)(Pi<8>P.)-jG 2 ]. (133) 

1 ~ jr s=0 »-2 

Using (133) in (126) provides us with the next correction to the 
eigenvalues by way of (118). 
Furthermore, (133) suggests a simplified "dynamics" for R n , namely, 

R n+l = EPxRnPx + E J [P 1+s P,P„- s PlPl +s 

- P.P.Pn-.P.Pco]. (134) 

A general discussion of these correction terms seems out of the 
question. In fact, the expectations are not trivial to do. Instead, we 
resort again to the simple model of (15), where A = I, andi^ = tr R n , 
and set N = 3. For this case, we have been able to do the expectations 
and compute the eigenvalues of ft and /? + v[\/(I — „^)]y. The 
eigenvalues results are given in Table II. Certainly, in this case the 
perturbation philosophy seems well justified. 

VII. CONCLUSIONS 

We conclude (as explained above) that a perturbation analysis 
suggests that the difference between independence theory and one 
which takes into account the correlations between the "gradient" 



t Using (A ® B)(C <8> D) = AC® BD, other forms are, of course, possible. 
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Table II — A comparison of 

the eigenvalues of fi and 

its perturbation for a 

special situation 



p 


""/-/' 


0.667 


0.674 


0.555 


0.543 


0.555 


0.555 


0.555 


0.555 


0.333 


0.337 


0.333 


0.333 


0.333 


0.333 


0.333 


0.333 


0.333 


0.333 



directions is slight. Our early worry was that the shifting property 



~Xi~ 




x 2 


x 2 


— » 


Xa 


X n 




_Xn + \ m 



(135) 



in going from one gradient direction to the next could cause trouble 
with independence theory. Any notion that this particular dependence 
must result in mathematics completely foreign to that of independence 
theory has been shown to be false. Independence theory is an inherent 
part of the exact description. 

The situation in (135) does, however, have the rigorous property 
that the "new" component (x n +i) is independent of the others. For real 
problems, this situation may well be violated in certain cases of severe 
intersymbol interference. Examining the N = 1 case leads us to propose 
the following criterion to measure this dependence. Namely, if, in the 
synchronous case, the received pulse h(t) [see (1)] is normalized so 
that Yr - h* = 1 » tnen we mi S ht ex P ect 



I 

N-l 



I 

l=- a 



h/hi 



« 1 



to be a good measure of independence for the new component. 

Our effort has been a long and tedious one, and our attempts to pull 
insights from complicated equations have sometimes been nonrigorous 
and no doubt occasionally colored by the previous experimental results 
and simulation results of others. 1 Thus, while the ultimate justification 
of independence theory must remain empirical, we hope that our 
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efforts at least make mathematically plausible the successes of inde- 
pendence theory. 

Finally, it is a pleasure to say that the present work has benefited 
from discussions with J. Salz, L. A. Shepp, N. J. A. Sloane, and H. S. 
Witsenhausen. 



APPENDIX A 

Evaluation of Some Averages 

For the purposes of this appendix, we drop the superscript in (11), 
labeling things as if n = 1. For application to (31) we consider the 
average 

EXX T RXX T (136) 

for an arbitrary N X N matrix R. Here (11) holds, and we are averaging 
over the binary variables in a. Expanding (136) using (11) we have to 
do the key average 

Eaa T Qaa T =EC, (137) 

where Q = B T RB. Thus from (137), 

(EC)ij = E% (aa T UQ kl (aa T hj 
k,/ 

= E £ aiakdidjQki. (138) 

*./ 

Using the fact that for independent binary variables 

Eaidkaidj = SikSij + SuSkj + SijSki — 28ikSuSij (139) 

we obtain, upon using (139) in (138) 

(EC)u = Qij + Qj, + (tr Q)Sij - 2Q ll 8 iJ . (140) 

In matrix notation, (140) becomes 

EC - Q + Q T + (tr Q)I - 2 diag Q, (141) 

with the definition 

(diag Qhj r= (Qu)S u . (142) 

Note that if the a, were unit- variance Gaussian, the last term in (141) 
(diag Q) would not arise. It will be dropped because it is small in usual 
cases. Finally, multiplying (141) on the left by B and on the right by 
B T we recover (136), obtaining (since Q is symmetric now) 

EXX T RXX T = 2 ARA + (tr RA)A - 2fi(diag B T RB)B T . (143) 

Now we recall that all terms in (143) are multiplied by a 1 . We would 
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neglect them all unless one can be large. In fact, (tr RA)A can be N 
times larger and hence this is the only term we need keep. 
We move on to consider (32), rewritten as 

-Ea(I - axx T )€[c T xx T - a n+J X T l (144) 

The term linear in a in (144) vanishes as a correspondence of (6), (7), 
and (9). One of the a 2 terms is 

a 2 EXX T €C T XX T . (145) 

Evaluating (145) using (143), we check to see if the dominant term can 
be large. It is given by 

Atrec T A = A(€ T Ac*). (146) 

If we introduce the (M - 1) vector u, having all zeros except a one in 
the (1 + J) place, then 

a n +j = u-a = u T a, (147) 

and the other a 2 term is proportional to 

ot 2 EXX T cu T aX T = a 2 EB[aa T (B T €U T )aa T ]B T . (148) 

Evaluating (148) using (138) and (141), we get 

A(e T Bu). (149) 

However, using (7) we readily verify Bu = v, and a final use of (9) 
shows that the two dominant a 2 terms (146) and (149) cancel. The 
other terms are truly a 2 terms (as opposed to a 2 N) and are neglected, 
leading us to replace (32) by zero. 

We have introduced enough tricks now so that the reader may easily 
reproduce (33). 



APPENDIX B 

Definiteness of Solution to (34) 

We give here an explicit demonstration that the solution to (34) 
retains its positive definite character. By induction on n, it is sufficient 
to show that R ln+l) is positive definite (> 0) if R {n) is. 

We make repeated use that R > if R is hermitian and <t> T R<f> > for 
any vector <f>. 

We recall A > (and therefore hermitian) and hence R {n+i) is 

hermitian. 

Rewrite the right member of (34) as 

(/ - aA)R(I - olA) + a 2 [A tr AR - ARA] + a 2 g*A. (150) 

Each term in (150) is positive definite; the only nonobvious one is the 
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second. However, it may be rewritten as 

y/A [tr VA R sfA - y/A R y/A] y/A, (151) 

since tr AB = tr BA. The matrix y/A R y/A is, of course, positive 
definite. Now observe that if B > then tr B — B > 0. This concludes 
the proof. 
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